Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revisit handling of images processing and other fixes #2143

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

benoit74
Copy link
Contributor

@benoit74 benoit74 commented Jan 28, 2025

This is kinda a significant PR to fix many issues around images processing.

Fix #2140
Fix #2136
Fix #2088
Fix #2138

Changes

  • when processing HTML, make a distinction between images, videos and other medias (PDF, ...)
    • store this information in redis for later retrieval
    • do not guess if a given media is an image based on its URL anymore
    • do not guess content-type from image URL or from response header, compute it with file-types based on real image data
    • remove corresponding functions isImageUrl, getMimeType and constants: IMAGE_URL_REGEX
    • remove now useless mime-type package
    • for now, videos do not have any special treatment (e.g. reencoding) but everything is ready for that
    • other medias are always simply downloaded
  • do not add a .webp suffix to the path of images which have been converted to webp
  • stop pushing content-type to S3 metadata
    • we do not need this information anymore
    • there are too many risks this information is wrong due to a bug
    • we can let things already in S3 with this metadata live as they are, there is mostly 0 consequences
  • define a clear API of information returned by downloader.downloadContent when downloading content, instead of the whole response upstream (which could contain "anything")

@benoit74 benoit74 self-assigned this Jan 28, 2025
@kelson42
Copy link
Collaborator

kelson42 commented Jan 28, 2025

@benoit74 Just to be clear, glad to see you working on the issue, but I don't think put webp content in path ending with .png (just an example) is a good idea at all. It is simply semantically wrong and we should not do that IMHO. Current approach works (modulo bugs - like always) and if we really want to do better we should keep track about the content mime-type (instead of relying on the extension).

@benoit74
Copy link
Contributor Author

Just to be clear, glad to see you working on the issue, but I don't think put webp content in path ending with .png (just an example) is a good idea at all. It is simply semantically wrong and we should not do that IMHO.

I agree, but this would mean a significant redesign of the scraper: with current architecture, as stated in the issue, we cannot know at HTML rewriting time what the result of image download/conversion will be ; for this we need to download the image and try the reencoding, which is currently done at a totally different stage.

For now I prefer to have a scraper producing working ZIMs under all conditions with some semantic incoherence invisible to 99% of our users, rather than having non-working ZIMs like #2088. I do not mind to open an issue to fix this semantic incoherence on the medium / long term. For the record, this semantic incoherence is already present since "forever" in S3 keys used to cache image and we lived pretty well with it.

@benoit74
Copy link
Contributor Author

Sample ZIMs:

@kelson42
Copy link
Collaborator

For now I prefer to have a scraper producing working ZIMs under all conditions with some semantic incoherence invisible to 99% of our users, rather than having non-working ZIMs like #2088. I do not mind to open an issue to fix this semantic incoherence on the medium / long term. For the record, this semantic incoherence is already present since "forever" in S3 keys used to cache image and we lived pretty well with it.

It's not and should be any incoherence in S3 because the entry is tagged "webp" AFAIK.

@benoit74
Copy link
Contributor Author

It's not and should be any incoherence in S3 because the entry is tagged "webp" AFAIK.

S3 key is computed from online URL directly without any logic handling webp conversion:

await this.s3.uploadBlob(stripHttpFromUrl(url), mwResp.data, etag, mwResp.headers['content-type'], this.webp ? 'webp' : '1')

Copy link

codecov bot commented Jan 30, 2025

Codecov Report

Attention: Patch coverage is 75.55556% with 33 lines in your changes missing coverage. Please review.

Project coverage is 75.19%. Comparing base (75cf0fe) to head (88bbeec).

Files with missing lines Patch % Lines
src/Downloader.ts 53.06% 19 Missing and 4 partials ⚠️
src/mwoffliner.lib.ts 60.00% 2 Missing ⚠️
src/renderers/abstract.renderer.ts 94.73% 2 Missing ⚠️
src/util/misc.ts 0.00% 2 Missing ⚠️
src/MediaWiki.ts 66.66% 1 Missing ⚠️
src/S3.ts 0.00% 1 Missing ⚠️
src/util/articleListMainPage.ts 0.00% 1 Missing ⚠️
src/util/saveArticles.ts 95.83% 1 Missing ⚠️

❌ Your patch check has failed because the patch coverage (75.55%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2143      +/-   ##
==========================================
- Coverage   75.98%   75.19%   -0.79%     
==========================================
  Files          41       41              
  Lines        3202     3213      +11     
  Branches      706      704       -2     
==========================================
- Hits         2433     2416      -17     
- Misses        655      676      +21     
- Partials      114      121       +7     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants